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Abstract 

Probabilistic databases play a preeminent role in the processing and management of uncertain data. 
Recently, many database research efforts have integrated probabilistic models into databases to support tasks 
such as information extraction and labeling. Many of these efforts are based on batch oriented inference which 
inhibits a realtime workflow. One important task is entity resolution (ER). ER is the process of determining 
records (mentions) in a database that correspond to the same real-world entity. Traditional pairwise ER 
methods can lead to inconsistencies and low accuracy due to localized decisions. Leading ER systems solve 
this problem by collectively resolving all records using a probabilistic graphical model and Markov chain 
Monte Carlo (MCMC) inference. However, for large datasets this is an extremely expensive process. One 
key observation is that, such exhaustive ER process incurs a huge up-front cost, which is wasteful in practice 
because most users are interested in only a small subset of entities. 

In this chapter, we advocate pay-as-you-go entity resolution by developing a number of query-driven 
collective ER techniques. We introduce two classes of SQL queries that involve ER operators — selection- 
driven ER and join-driven ER. We implement novel variations of the MCMC Metropolis Hastings algorithm to 
generate biased samples and selectivity-based scheduling algorithms to support the two classes of ER queries. 
Finally, we show that query-driven ER algorithms can converge and return results within minutes over a 
database populated with the extraction from a newswire dataset containing 71 million mentions. 


1 Query-Driven Entity Resolution Introduction 

Entity resolution (ER) is the process of identifying and linking/grouping different manifestations (e.g., mentions, 
noun phrases, named entities) of the same real world object. It is a crucial task for many applications including 
knowledge base construction, information extraction, and question answering. For decades, ER has been studied 
in both database and natural language processing communities to link database records or to perform entity 
resolution over extracted mentions (noun phrases) in text. 

ER is a notoriously difficult and expensive task. Traditionally, entities are resolved using strict pairwise 


similarity, which usually leads to inconsistencies and low accuracy due to localized, myopic decisions 39 


More recently, collective entity resolution methods have achieved state-of-the-art accuracy because they leverage 
relational information in the data to determine resolution jointly rather than independently [^. However, it 
is expensive to run collective ER based on probabilistic graphical models (GMs), especially for cross-document 
entity resolution, where ER must be performed over millions of mentions. 

In many previous approaches, collective ER is performed exhaustively over all the mentions in a data set, 
returning all entities. Researchers have developed new methods to perform large-scale cross-document entity 
resolution over parallel frameworks 33 39 . However, in many ER applications, users are only interested in one 
or a small subset of entities. This key observation motivates query-driven ER, an alternative approach to solving 
the scalability problem for ER. 

Compared to previous ER models and algorithms, query-driven techniques in this chapter scale to data sets 
that are in many cases three orders of magnitude larger. Moreover, the ER model in this chapter is general enough 
to take both bibliographic records and mentions extracted from unstructured text. Query-driven ER techniques 
over GMs can also be generalized for other applications to perform query-driven inference. 

This work follows a line of research on implementing ML models inside of databases 18 22 38 . Researchers 
use factor graphs because this flexible representation works well with other machine learning algorithms. ER is 
ubiquitous and an important part of many analytic pipelines; a probabilistic database implementation is natural. 

In this chapter, we hrst introduce SQL-like queries that involve ER operations. These ER operators are an 
SQL comparison operator (i.e., ER-based equality) that returns true if two mentions map to the same entity. 











Factor Graphs, a type of GM, are used to model the collective entity resolution over extracted mentions from 
text. Using this ER based comparison operator, users can pose selection queries to find all mentions that map to 
a single entity or pose join queries to find mentions that map to the subset of entities that they are interested in 
resolving. 

Because exhaustive ER is expensive it is common to use blocking techniques to partition the data set into 
approximately similar groups called canopies. Query-driven ER in this chapter differs from blocking in two 
important ways: 1) deterministic blocks are replaced by a pairwise distance-based metric, and 2) blocks (or 
canopies) are implicit to the query-driven ER data set and do not have to be created in advanced. The latter 
point, implicit blocking, is realized using a data structure created based on the similarity to a query mention. 
This data structure allows parameters to include or remove mentions from the working data set. This property 
is similar to the iterative blocking technique 37 , which is shown to improve ER accuracy. Such an approach can 


dramatically amortize the overall ER cost suitable for the pay-as-you-go paradigm in dataspaces 25 


To support ER driven by queries, we develop three sampling algorithms for MCMG inference over graphical 
models. More specifically, instead of a uniform sampling distribution, we sample on a distribution that is biased 
to the query. We develop a query-driven sampling techniques that maximizes the resolution of the target query 
entity (target-fixed) and biases the samples based on the pairwise similarity metric between mentions and query 
nodes (query-proportional). We also introduce a hybrid method that performs query-proportional sampling over a 
fixed target. We develop two optimizations to the query-proportional and hybrid methods to model the similarity 
and dissimilarity between the mentions and the query entity, i.e., attract and repel scores. In the first target-fixed 
algorithm, we adapt the samples to resolve the query entity. The second query-proportional algorithm, selects 
mentions based on their probabilistic similarity to the query entity. The third hybrid algorithm combines the two 
approaches. A summary of approaches can be found in Table 

When a user is interested in resolving more than one entity we employ multi-node ER techniques. To 
implement multi-node ER queries, single-node ER techniques may be naively performed iteratively to resolve one 
entity at a time. However, such an algorithm can lead to un-optimized resource allocation if the same number 
of samples is generated for each target entity, or low throughput if one of the entities has a disproportionately 
low convergence rate. To alleviate this problem, we present three multi-query ER algorithms that schedule the 
sample generation among query nodes in order to improve overall convergence rate. 

In summary, the contributions of this chapter are the following: 


• We define a query-driven ER problem for cross-document, collective ER over text extracted from 
unstructured data sets; 


• We develop three single-node algorithms that perform focused sampling and reduce convergence time 
by orders-of-magnitude compared to a non-query-driven baseline (Section]^. We develop two influence 
functions that use attract and repel techniques to grow or shrink query entities (Section |5.I[); 


We develop scheduling algorithms to optimize the overall convergence rate of the multi-query ER 
(Section 5.21. The best scheduling algorithm is based on selectivity of different target entities (Section 


The results show that query-driven ER algorithms is a promising method of enabling realtime, ad-hoc, ER- 
based queries over large data sets. Single node queries of different selectivity converge to a high-quality entity 
within 1-2 minutes over a newswire data set containing 71 million mentions. Experiments also show that such 
real-time ER query answering allows users to iteratively refine ER queries by adding context to achieve better 
accuracy (Section]^. 


2 Query-Driven Entity Resolution Preliminaries 

In this section we present a foundation of concepts discussed in this chapter. We start with an introduction of 
factor graphs then discuss sampling techniques over this model. Finally, we formally introduce state-of-the-art 
entity resolution approaches and explain the origin. 


2.1 Factor Graphs Graphical models are a formalism for specifying complex probability distributions over 
many interdependent random variables. Factor graphs are bipartite graphical models that can capture arbitrary 
relationships between random variables through the use of factors [^. As depicted in Figure links always 
connect random variables (represented as circles) and factor nodes (represented as black squares). Factors are 
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Figure 1: Three node factor graph. Circles (random variables) with represent mentions and those with 
represent entities. Clouds are added for visual emphasis of entity clusters 


functions that take as input the current setting of connected random variables, and output a positive real-valued 
scalar indicating the compatibility of the random variables settings. The probability of a setting to all the 
random variables is a normalized product of all the factors. Intuitively, the highest probability settings have 
variable assignments that yield the highest factor scores. 

We use factor graphs to represent complex entity resolution relationships. Nodes (random variables) may 
correspond to mentions of people, places and organizations in documents. Nodes also represent the random 
variables that correspond to groups of mentions (entities), these nodes are accompanied by clouds in Figure 
The factors between mentions and entities give us a sound representation for many possible states. The factor 
graph model also gives us a simple mathematical expression of the relationship. 

Formally, a factor graph Q = (x, '(/)) contains a set of random variables x = {xi}^ and factors '0 = 

Each factor 0j maps the subset of variables it is associated with to a non-negative compatibility value. The 
probability of a setting oj among the set of all possible settings occurring in the factor graph is given by a 
probability measure: 
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where x’’ is the set of random variables that neighbor the factor 0i(-) and Z is the normalizing constant. 

Querying graphical models produces the most likely setting for the random variables. A query on a factor 
graph is defined as a triple {xq,xi,Xe) where Xg is the set of nodes in question, xi is a set of latent nodes (entities) 
that are marginalized and Xe is a set of evidence nodes (observed mentions). A query task is a sum over the all 
latent variables and the maximization of the query probability. A query over the factor graph is defined as 

Q{xq,xi,Xe,Tr) = ^ Tr{xq U vi U Xe). 

Vl&Xi 


To obtain the best setting of the queries in question, inference is required. 

Several methods exist for performing inference over factor graphs. The entity resolution factor graph, being 
pairwise, is dense and highly connected. This property suggests the best methods for inference are Markov Chain 


Monte Carlo (MCMC) methods; in particular, we use a Metropolis Hastings variant 21 . We refer the reader to 


our previous work for a detailed discussion on inference over factor graphs and a deviation of the technique 40 


2.2 Inference over Factor Graphs Several methods exist for performing inference over factor graphs. The 
entity resolution factor graph, being pairwise, is dense and highly connected. This property suggests the best 
methods for inference are Markov Chain Monte Carlo (MCMC) methods; in particular, we use a Metropolis 
Hastings variant 21 . 











The idea of MCMC-MH is to propose modifications to a current setting and use the model to decide whether 
to accept or reject the proposed setting as a replacement for the current settings. When the models are being 
scored only the factors touching nodes with changed values, the Markov blanket, needs to be recomputed. We 
accept or reject changes so the model can iteratively proceed to an optimal setting. 

More formally, consider an MCMC transition function T : x —>■ [0,1] where given the current setting uj 

we can sample a subsequent setting oj'. 

The probability of accepting a transition given a graphical model distribution tt is: 


( 2 . 1 ) 


A(uj, uj') = min 


V ’ 7r(a;)r(a;',a;) ) ' 


Additionally, the intractable partition function Z is canceled out, making sample generation inexpensive. This 
property allows us to calculate the probability of accepting the next state by simply computing the difference in 
score between the next and current state 


40 


We say the algorithm converges when a steady state is reached}^ Intelligently sampling next states decreases 
the time to convergence. Convergence in MCMC is difficult to verify [^, we discuss convergence estimation in 
Section 16.11 


2.3 Cross-Document Entity Resolution Cross-document ER is the problem of clustering mentions that 
appear across independent sets of documents into groups of mentions that correspond to the same real world entity. 
These ER tasks typically assume a set of preprocessed documents and perform linking across documents [4 
The scale of the cross-document ER problem is typically several orders of magnitude more than intra-document 
ER. There are no document boundaries to limit inference scope and all entity mentions may be distributed 
arbitrarily across millions of documents. 

To model cross-document ER, let M. = {mi,... ,?7i|^|} be the set of mentions in a data set. Each mention 
mi contains a set of attribute-value data points. Let £ = {ei,..., represent the set of entities where each 
Ci contain zero or more mentions. Note, we assume the maximum number of entities is no more than the number 
of mentions and no less than 1. Each mention may correspond to a unique entity or all mentions may correspond 
to a single entity. 

The baseline method of entity resolution is a straight-forward application of the MCMC-MH algorithm. We 
show pseudo code for the baseline method in Algorithm 


33 


Algorithm 1 The baseline entity resolution algorithm using Metropolis-Hastings sampling 
INPUT: A set of unresolved entities £ each with one mention m. 

INPUT: A positive integer samples. 

OUTPUT: A set of resolved entities £. 

1: while samples— > 0 do 

2 : ei £ 

3 : ej £ 

4: m e* 

5: £' ^MOVE{£,m,ej) 

6: if SCORe)^) < SCORe(£’') then 

7: £^£' 

8 : end if 

9: end while 

return £ 


Algorithm Intakes as input a set of entities £ and samples which is the number of iterations of the algorithm 
or a function to estimate convergence. The algorithm samples two entities from the entity set and moves one 


^We refer to literature for a more detailed description of convergence |40| 












randoir0nention into the other entity. After the move, the algorithm checks for an improvement in the overall 
score of the model. If the model score improves, the changes are kept, otherwise the proposed changes are ignored. 
The SCORE function sums the weights of all the edges in the given entity to obtain a value for the model. This 



In this chapter, we use two methods of blocking. First, we use an approximate string match over all the 
mentions in the database. To perform the approximate string filter we use a g-grams technique over all the 
mentions in the database. This method creates an inverted index for each mention in the database so a query 
can be performed to look for all words that contain a sufficient number of matching g-grams. This gives us a fast 
high-recall filter over many records 
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The second is an implicit blocking structure created by computing the influence a query node has on the 
other nodes in the data set (see Section 5.11. This method uses an estimate of the distant between the query 
nodes and the candidate mentions to prioritize samples. 


3 Query-Driven Entity Resolution Problem Statement 

In this section, we formally define the problem of query-driven ER. We use an SQL-like formalism to model 
traditional and query-driven entity resolution. 

In a probabilistic database, let a Mentions table contain all the extracted mentions from a text corpus. Its 
column entity^ represents the probabilistic latent entity labels; they contain a mapping but that mapping may not 
represent the current state. The People table holds a watchlist of mentions and relevant contextual information. 
The context column is an abstract place holder for text data or richer schemas. This model only assumes there 
is a master column, the realization of the context column is flexible and implementation dependent. 


Mentions(docID, startpos, mention, entity^’, context) 
People(peoplelD, mention, entity^, context) 


We also define a user-defined function coreEmap that performs maximum a posteriori (MAP) inference on 
the latent entity^ random variables. The function takes two instances of mentions with at least one being from a 
probabilistic table such as the Mention table. When the query is executed the coreLmap function returns true if 
the mentions referenced are coreferent. Following, we describe the traditional exhaustive ER task as well as the 
single- and multi-node query-driven ER queries. 

Exhaustive The goals of traditional entity resolution is to cluster all mentions in a data set. All the mentions 
clustered inside each entity are coreferent with each other and not entity with any mention that is a part of a 
different entity cluster. The process of exhaustive ER can be modeled as a self-join database query where each 
mention is grouped into coreferent clusters. In Algorithm we create a view displaying the results of a resolved 
query. 


Algorithm 2 Example exhaustive entity resolution query that createds a database view 


CREATE VIEW CorefView AS 

SELECT m.docID, m.startpos, m.mention, m2.mention 
FROM Mentions m. Mention m2 

WHERE coref_map(m.*, m.entity'p), m2.mention, m2.context) 


To obtain unique entity clusters, we can perform an aggregation query over the CorefView. In Figure we 
see an example of the result of traditional entity resolution. 


^ Given a set X, the function x ~ 


'u X makes a uniform sample from the set X into a variable x. 


















Figure 2: A possible initialization for entity resolution 


Single-node Query In the ER task, we may only be interested in the mentions of one entity. We represent 
this entity with a template mention, or as a query node q. Single-node entity resolution is modeled as a selection 
query with a where-clause that includes the template mention q and returns only the mentions that are members 
of the entity cluster that contains the sample mention. Given a template mention q and its context q. context, 
Algorithm we show the single-node query based on an example in Section 4.2 


Algorithm 3 Single query-node driven entity resolution query 

SELECT m.docID, m.startpos, m.mention 
FROM Mentions m 

WHERE coref_map(m.*, m.entity'p), q, q.context) 


Here we add parameters to the coreLmap function that contain the specific query and its context. It performs 
ER over the mentions table but only returns an affirmative value if the labels for the entity cluster match the query 
node. For example, if the template mention q was ‘Mark Zuckerberg’, and the query context were keywords such 
as ‘facebook’ and ‘ceo’, the only returned mentions will be those that represent Mark Zuckerberg the facebook 
founder. This is similar to a ‘facebook’ approximate string search. The emphasis of this chapter is optimizing 
this function so while performing ER we perform less work compared an exhaustive query. 

Multi-Query In many cases, a user may be interested in a watchlist of entities. Watchlist is a subset of 
the larger Mention set. This is common for companies looking for mentions of its products in a data set. In this 
case, mention are only clustered with the entities represented in the watchlist. Algorithm is an example of a 
join-query between the Mentions table and the People table. 


Algorithm 4 Multi-query between the Peoples watch list tablse and the full mentions set 

SELECT m.docID, m.startpos, m.mention, q 
FROM Mentions m, People q 

WHERE coref_map(m.* , m.entity"p) , q, q.context) 


This function combines a watch list of terms and performs ER with respect to the specific examples in the 
watchlist. The multi-query method uses scheduling to perform inference, or a fuzzy equal, over each mentions. In 
Section [4^ we propose scheduling algorithms so multi-query node ER gracefully manage multiquery workloads. 

4 Query-Driven Entity Resolution Algorithms 

Query-driven ER is an understudied problem; in this section we describe our approach to query-driven ER with 
one entity (single-query ER) and with multiple entities (multi-query ER). First, we give a graphical intuition of 
query-driven ER algorithms. 










Figure 3: The correct entity resolution for all mentions 



Figure 4: The entity containing q is internally coreferent; the other entities are not correctly resolved 


4.1 Intuition of Query-Driven ER In this section, we remind the reader of the query-driven ER task with 
a formal definition. Each ER task is given a corpus Q and a set of entity mentions A4 = {mi,..., m\m\} extracted 
from the Q. A user may supply set of query nodes Q — {gi,..., q|Q|}. Each qi, also called a query template, 
may be a member of At or a manually declared mention that is appended to the set of mentions. For each node 
Qi S Q, the task of ER is to compute the set of mentions E = {ei,..., £\q\} that only contain mentions that are 
coreferent with the query node, 

= {mi\mi G M,QTiER{M,mi,qi)}. 


In Section 4.2 we describe implementations of the QDER algorithm for \Q\ = 1. In Section [4.3[ we describe 
techniques of scheduling the ER task for the general case of \Q\ > 1. 

Fundamentally, the ER algorithm generates a graphical model and makes new state proposals (jumps) to 
reach the best state (see Section |^. The query-driven algorithms in this section use a query node to facilitate 
more sophisticated jumps. By making smart proposals we expect faster convergence to an accurate state. As a 
note to the reader, a summary of query-driven algorithms can be found in Table 

Figure]^ shows an initial configuration and acceptable query-driven entity resolution solutions. An example 
initial state of this algorithm is shown in Figure — each mention is initially assigned to separate entities. 
Alternatively, the model may be initialized randomly, or in an arrangement from a previous entity resolution 
output or with all mentions in one entity. Figure |^is the full resolution for the data set; each mention is correctly 
assigned to its entity cluster. Figure]^ is a result that was resolved with query-driven methods and is a partially 
resolved data. Because the entity containing the query node is completely resolved the solution is acceptable. 


4.2 Single-Node ER Single-node ER algorithms are the class of algorithms that resolve a single query-node 
as discussed in Section In particular, the target-fixed ER algorithm aims to focus a majority of the proposals 
on resolving the query entity. The algorithm fixes the query node as the target entity and then randomly selecting 
a source node to merge into the entity of the target query node. This focus on building the query entity in this 
type of importance sampling means the query entity should be resolved faster than if we sampling each entity 
uniformly. 

A query-driven ER algorithm that only selects the query-node as the target entity during sampling will create 
errors because such an algorithm is unable to remove erroneous mentions from the query entity. To prevent these 
errors, we allow the algorithm to occasionally back out of poor decisions, that is, it makes non-query specific 
samples. Shown in Algorithm target-fixed entity resolution adapts Algorithm but it allows parameters to 
specify the proportion of time the different sampling methods are selected. 

In addition to the input mentions 8 from Algorithm target-fixed entity resolution takes as input a query 
node q. The output of the algorithm is a resolved query entity and other partially resolved entities. 

For each sampling iteration the algorithm can make two decisions. The sampler may propose to merge a 
random source node that is not already a member of the query entity into the target query entity. Alternatively, 
the algorithm merges a random node with a random entity. 

On lines to the algorithm takes a uniform sample from the list of entities. If the sampled entity is the 
same as the query entity it tries again and samples a distinct entity. A node is drawn from this entity. The 
probability of this block being entered is Tq,. Lines [7| to [l0| are entered with a probability (1 — To). This block 
performs a random entity assignment in the same manner as Algorithm This block offsets the aggressive nature 
of the target-hxed algorithm by probabilistically backing out of any bad merges. Finally, the block starting from 
line 12 to line [^scores the new arrangement and accepts if this improves the model score. We discuss parameter 
settings in Section [53} 

Example Take the synthetic mention set AI shown in Table [l] and a query node q, the baseball team 
‘New York Yankees’, in Table This is the result of the approximate match of query q over a larger data set 
(blocking). The mentions of A4 may be initialized by assigning each mention to its own entity. After a successful 
run of traditional entity resolution the set of entities clusters are 


{(q,TO 2 ,m 4 ,m 6 ), (mi,m 3 ), (ms)}. 

For query-driven scenario the only entity we are interested in is {q, m 2 , m 4 , mg). Each mention in this query entity 
is an alias for the ‘New York Yankees’ baseball team. The other two mentions represent the ‘New York Giants’ 
football team and the ‘Brooklyn Dodgers’ baseball team respectively. 




Algorithm 5 Target-fixed entity resolution algorithm 
Input: A query node q. 

A set of entities £ each with one mention m. 
A positive integer samples. 

Output: A set of resolved entities £' .\ 

1 : £' £ U q 

2: while samples— > 0 do 
3: if random() < Ta then 

4: ei £ 

5: ej ^ q.entity 

6: m Ci 

7: else 

8 : ej ^ {e|3e, e G 5', e yf g.entity} 

9: Ci ^ {e|3e, e G £',e ^ ej} 

10 : m ei 

11: end if 

12: £” ^MOVE{£',m,ej) 

13: if SCORE(f') < SCORe(£") then 

14: f^ £" 

15: end if 

16: end while 

return £' 


Table 1: Mentions sets A4 from a corpus 


id Mention 


mi 

NY Giants 

m2 

Bronx Bombers 

ms 

New York Giants 

7714 

Yankees 

ms 

Brooklyn Dodgers 

me 

The Yanks 


Table 2: Example query node q 


id 

Mention 

q 

New York Yankees 







The target-fixed algorithm attempts to merge nodes with the query entity one mention at a time and the 
merge is accepted if it improves the score of the overall model. We can see in the example that a merge of mi 
and m 3 may improve the overall model because they have similar keywords but one refers to the query entity and 
the other to different football team. The target-fixed algorithm can correct this type of error by probabilistically 
backing out of errors by moving mentions in the query node to a new entity as show in line to line 10 of 
Algorithm 

4.3 Multi-query ER A user may want to resolve more than one query entity, that is, she may be interested 
in resolving a watch list of entities over the data set. To support multiple queries, first merge the canopies of each 
query node in the watch list to obtain a subset of the full graphical model containing only the nodes similar to 
query nodes. To resolve the entities we can use query-proportional methods iteratively over each query node. We 
define two classes of schedules, namely, static and dynamic. 

Static schedules are formulated before sampling while dynamic schedules are updated in response to estimated 
convergence. The two static schedules we develop are random and selectivity-based. In random scheduling each 
query node from the watch list is selected in a round robin style. Selectivity-based scheduling is a method of 
ordering multi-query samples to schedule proposals in proportion to the selectivity of the query node. Selectivity, 
in this case, is defined as the number of mentions retrieved using an approximate match of the data set, or the 
query node’s contribution to the total new graphical model. For example, the selectivity of our query node q in 
Table the selectivity is simply the size of A4, shown in Table 

Random-based scheduling method performs well if all query nodes come from similar selectivity. Otherwise, 
if the selectivity of each query node vary, one query node may require more sampling compared to the others. If 
one query node needs a lot of samples to converge, it may take the whole process a long time to complete and 
cycles may be wasted on other nodes that have already converged. 

In addition to scheduling samples in proportion to their selectivity, we can schedule samples dynamically, 
depending on the progress of each query entity. To perform dynamic scheduling we need to know how each 
query entity is progressing towards convergence. To estimate the running convergence we do not use standard 
techniques in literature because scheduling needs to occur before the model is close to convergence 10 . Instead, 


we estimate the convergence by measuring the fraction of accepted samples over the last N samples of each query 
in the watch list. The two dynamic scheduling algorithms are closest-first and sampling the farthest-first. In 
closest-first we queue up the query node that has the lowest positive average number of accepted nodes over the 
last N proposals. This scheduling method performs inference for the node that is closest to being resolved so it 
can move on to other nodes. Alternatively, the farthest-first algorithm schedules the node that has the highest 
convergence rate. This scheduling algorithm makes each query entity progress evenly. 


5 Optimization of Query-Driven ER 

The previous ER techniques aggressively attempt to resolve the query entity. However, if the query node is 
not representative of the query items performance of target-fixed ER can lead to undesirable results. We do 
not explore this trade-off; we assume users can select representative query nodes. In this section, we introduce 
optimizations to create approximate query-driven samples based on the query node. We first discuss the influence 
function that is used to make query-driven proposals. We then discuss the attract and repel versions of the 
influence function followed by two new algorithms. We end with implementation details and a summary of our 
query-driven algorithms. 


5.1 Influence Function: Attract and Repel To retrieve nodes from a graphical model that is similar to 
a query node we employ the notion of influence. Our assumption is that nodes that are similar have a high 
probability of being coreferent. An influence trail score between two nodes in a graphical model can be computed 
as the product of factors along their active trail as defined in literature 40 . For a node S At and the query 
node q G A4 the influence of on the query node is defined as: 


I{mi,q) = Wji>j{mi,q) 

3^:f 

where T is the world of pairwise features and the feature weight and log-linear function are, respectively, Wj and 
tpj. The influence function I is an implementation of this trail score. 





The influence function takes a set of entities — or the equivalent GM — and a query node q as parameters. 
The parameters to an influence function can be over the whole database or a canopy. Over several invocations 
of the function, I returns mentions from the graphical model with a frequency proportionate to their influence 
on q. If a mention has little or no influence, the influence acts as a blocking function, infrequently returning the 
mention. Recall influence is the distance active trail distance to query node. To implement the influence function 
we build a data structure based on an algorithm by Vose 


36 , hereafter referred to as a Vose structure. 


The input mentions to the blocking algorithms may result in high or low quality canopies. A high quality 
canopy means most of the mentions in the canopy are associated with the query node. Low quality canopies, 
which are more common, corresponds to only a small number of mentions being associated with the query node. 
When initializing query-driven algorithms the canopy quality is important for determining what algorithm to use. 

The attract method initializes each mention in the canopy in its own entity, and then mentions are merged 
nntil the convergence. The target-fixed algorithm discussed in Section 4.2 is explained using this method. The 
attract method works well for low quality canopies, or canopies that require a small number or items to merge. 
Conversely, the repel method works well with high quality canopies or when most items in a canopy belong to 
the query entity. 

The repel method initializes each mention in the canopy into a single entity. Then proposals are made to 
remove mentions from the entity so we are left with only the nodes in the query entity. We discuss this method 
using the hybrid algorithm in Section |5.3[ To build an influence function for the repel method we can use the 
same method and we only need to normalize and invert the influence scores. We refer to this as co-influence or I. 


5.2 Query-proportional ER In the query-proportional sampling algorithm, on every iteration, the source 
mention and target entity are selected in proportion to its distance to the query entity. Instead of focusing solely 
on the query entity, this algorithm prioritizes samples using a measure that represents probability of a mention 
being coreferent with the query entity. 

That is, each node p in the graphical model Q is selected on the active trail between itself and the query node 
q. This algorithm merges nodes that are similar to the query node with an increased frequency. 

Before query-proportional sampling, a data structure for I is created. The I influence structure takes a 
query node q and the global graphical model £ then returns a sampled mention. As X is called multiple times, the 
distribution of the nodes returned is proportional to their influence. Algorithm [^describes the query-proportional 
algorithm. 


Algorithm 6 Query-proportional algorithm 
Input: A query node q to drive computation. 

A set of entities £ each with one mention m. 

A positive integer samples. 

A function X that samples from nodes entities according to its influence on a mention. 
Output: A set of resolved entities £'. 

1: t— £ U g 

2 : while samples— > 0 do 
3: TOi X(£', q) 

4: TO2 t—X(£', g) 

5: £" t—MOVE(£', TOi, 7712.entity) 

6: if score(£') < score(£") then 

7: £' ^ £" 

8 : end if 

9: end while 

return £' 


For each iteration, the algorithm selects mentions using the influence function (line and line|^. Then, one 
mention mi is moved into the entity of m 2 . Mentions mi and m 2 have a higher probability of being coreferent 
and therefore a higher probability of a merge occurring in the query entity compared to random selections as in 








Algorithm As a corollary, the influence sampling property creates many small entities that are similar to the 
query entity. 

During query-proportional sampling more entities that are similar to the query node are created. Some of 
the mentions created in intermediate entities during query-proportional sampling will move to the query entity. 
This is a big advantage when performing entity-to-entity merges (as opposed to mention to entity merges). In 
this chapter, we do not investigate this extension to the algorithm. 

5.3 Hybrid ER The best of both the target-fixed and query-proportional algorithms can be combined to create 
a hybrid algorithm. Like the target-fixed algorithm, the hybrid method aggressively fixes the target as the query 
entity. The hybrid method also chooses its source node using the influence function in the same manner as the 
query-proportional algorithm. 

Algorithm shows the hybrid algorithm using the repel method. With probability Tq the algorithm chooses 
a mention using the repel method (X) and moves it to an entity that is not the query node. This is the opposite 
of merging a node into the query entity. Pseudocode is listed on lines to line[^ 


Algorithm 7 Hybrid-Repel algorithm 

Input: A set of entities £, where one contains all the mentions m and the others are empty. 

A positive integer samples. 

A query node q. 

A function X that samples from nodes entities according to its influence on a mention. 
Output: A set of resolved entities £'. 

1: U g 

2 : while samples— > 0 do 
3: if random() < Tq, then 

4: m •<— X(£', q) 

5: Ci ^ {e|3e, e S f',e ^ g.entity} 

6: else 

7: e^ S 

8 : ej •<— {e|3e, e G £',e e^} 

9: m ej 

10: end if 

11: M0VE(£', TO, Bi) 

12: if SCORE {£') < SCORe(£’") then 

13: £' G- £" 

14: end if 

15: end while 

return £' 


5.4 Implementation Details The previous algorithms described single process sampling over the set of 
mentions. The multi-query methods are modeled for several interwoven sequential single-node ER processes. 
In this section, we describe our implementation of the hybrid algorithm over a parallel database management 
system. 

An independent Vose structure (X,§ 5.1) is created for each query node in the query set. The creation of the 
Vose structure query nodes is parallelized. When the number of query nodes increases the Vose structures demand 
more memory from the system. Each Vose structure contains array of type double precision and unsigned int. 
The space for the structure is 0{\Q\ ■ \M\) where |Q| is the number of query nodes in the query and \M\ is the 
number of mentions in the corpus. The Vose structure is accessed over every sample and needs to be in memory. 
To increase scalability, one could store the full sets of precomputed samples and serialize the Vose structures to 
disk but that is not explored here 


19 


Sampling over the query nodes for each algorithm can also be perform in parallel. In our method, a thread 








selects a query node using a random schedule as described in Section 4.3 The system will use the Vose structure 


associated with the query node to set up a proposal move. The system attempts to obtain a locks for both entities 
involved in the proposal. If the system is unable to obtain a lock on either of the two entities the system will back 
out and resample new entities. When the number of query nodes is small the query-driven algorithms experience 
lot of contention at the entities containing the query nodes. In these circumstances, the system will back out and 
either restart the proposal process or attempt a baseline proposal. This avoids waiting for locked entities and 
keeps the sampling process active. In Section [6?6| we demonstrate the parallel hybrid method over a large data 
set. 


5.5 Algorithms Summary Discussion Algorithms andare modifications of proposal jumps found in 
the baseline Algorithm[^ Table [^describes the proposal process for each algorithm by its preferred jump method. 

Table 3: Summary of algorithms and their most common methods for proposal jumps 



source 

target 

Baseline 

random 

random 

Target-Fixed 

random 

fixed 

Query-Proportional 

proportional 

proportional 

Hybrid 

proportional 

fixed 


The target-fixed algorithm builds the query entity by aggressively proposing random samples to merge into 
the query entity. The query-proportional algorithm uses an influence function to ensure its samples are mostly 
related to the query node. The hybrid algorithm mixes the aggressiveness of the target-fixed with the intelligent 
selecting of the source node found in the query proportional method. 

After choosing the correct algorithm, a user needs to have a well trained model with several features. An 
advantage of using query-proportional techniques, because so little sampling is required, is that we can interactively 
test query accuracy. We can and also add context or keywords that were discovered from a previous run of the 
algorithm. This interactive querying workflow will help improve accuracy, which we experimentally verify in 
Section 0 

Parameter settings The algorithm takes several parameters that affect performance. While not studied in 
this chapter, parameter settings are robust to change making parameter selection simple. The first is the number 
of proposals (samples). This number can be a function on the size of the data set. Each query node should have 
the opportunity to be merged into an entity more than once. 

The value Tq is between [0.0,1.0] and represents how often to perform the main type of sampling. This value 
should be set to a high value, 0.9 for accept algorithms. With probability 1 —Tq the algorithms back-off to random 
samples to improve mixing. This value is lowered to counter some of the aggression, particularly in Algorithm 
The parallel experiments use a Tq, = 1 and back out when there is contention in the threads. 

In statistics, a negative binomial function is used to model the number of trials it takes for an event to be a 
success. We can also use a negative binomial function as a decay function for the output of the influence function. 
We use this function because we want values that are most similar (lowest score) to be sampled more often. We 
set the r value, or number of failures for the negative binomial function to 1. We set the p value, or the probability 
of each success to a value close to 0.05. 

In the multi-query ER algorithms we run inference for K steps before we look to change the query entity. 
In our experiments we choose a A of 500 and an increasing value from two to 100 thousand in the parallel 
experiments. 


6 Query-Driven Entity Resolution Experiments 

In this section, we describe the implementation details, the data sets and our experimental setup. Next, we discuss 
our hypotheses and four corresponding experiments. We then finish with a discussion of the results. 

Implementation We developed the algorithms described in Section in Scala 2.9.1 using the Factorie 
package. Factorie is a toolkit for building imperatively defined factor graphs 27 . This framework allows a 


templated definition of the factor graoh to avoid fully materializing the structure. The training algorithms are 







also developed using Factorie. The algorithms for canopy building and approximate string matching are developed 
as inside of PostgreSQL 9.1 and Greenplum 4.1 using SQL, PL/pgSQL and PL/Python. Inference is performed 
in-memory on an Intel Core i7 processors with 3.2GHz, 8 cores and 12GB of RAM. The approximate string 
matching on Greenplum is performed on a AMD Opteron 6272 32-core machine with 64 GB. 

The parallel experiments were developed entirely in a parallel database, DataPath [^. DataPath is installed 
on a 48-core machine with 256 GBs. 

Data sets. The experiments use three data sets, the first is the English newswire articles from the Gigaword 
Corpus, we refer to this as the NYT Corpus 15 . The second is a smaller but fully-labeled Rexa data set Because 
it is fully-labeled it allows us to run the more detailed micro benchmarks. The NYT corpus contains 1,655,279 
articles and 29,866,129 paragraphs from the years 1994 to 2006. We extracted a total of 71,433,375 mentions 
using the natural language toolkit named entity extraction parser . Additionally, we compute general statistics 
about the corpus including the term and document frequency and tf-idf scores for all terms. We manually labeled 
mentions for each query over the NYT data set. 

The second data set, Rexa, is citation data from a publication search engine named Rexa. This data set 
contains 2454 citations and 9399 authors of which 1972 are labeled. We perform experiments on the Rexa corpus 
because it is fully labeled unlike the NYT Corpus. The Rexa corpus is smaller in total size but it has average 
sized canopies. 


The third data set is the Wikilinks Corpus 34 largest labeled corpus for entity resolution that we could find 


at the time of development. It contains 40 million mentions and 3 million entities that were extracted from the 
web and truthed based on web anchor links to Wikipedia pages. We loaded a million mentions onto DataPath to 
demonstrate the parallel capabilities. 


6.1 Experiment Setup Table|^lists the features and the weights for each feature. 

Features. Features that look for similarity between mention nodes are called affinity features and they are 
given positive weights. Features that look for dissimilarity between mentions nodes are called repulsion factors 
and they are given negative weights. We implement three classes of features: pairwise token features, pairwise 
context features and entity-wide features. Pairwise features directly compare tokens strings on attributes such 
as equality or matching substrings. Context features compare the information surrounding the mention. We can 
look at the surrounding sentence, paragraph, document or user specified keywords. The query nodes are extracted 
from text and contain a proper document context. With this context, we use a tf-idf weighted cosine similarity 
score to compare the context of each mention token. Finally, entity-wide features use all mentions inside an entity 
cluster to make a decision. An example entity-wide feature counts the matching mention strings between two 
entities. 

Models. Features on the NYT and Wikilinks data sets were manually tuned and the features for the Rexa 
data set were trained using sample rank [41| with confidence weighted updates. We manually tune some of the 
weights in the NYT corpus to make up for the lack the complete training data. The models can be graphically 
represented as the models in Figure]^ 

Evaluation metrics. Convergence of MCMC algorithms is difficult to measure as describe in a review by 
Cowles and Carlin 10 . We estimate the convergence progress by calculating the /I score of the query node’s 
entity (/Iq). We create this new measure because we are primarily concerned with the query entity. Other 
measures include B^ for entity resolution and several others for general MCMC models [4,10 . 

The query-specific /I score is the harmonic mean of the query-specific recall Rq and query-specific precision 
Pq. To accurately determine the Pq and Rq of each query in this experiment we label each correct query 
node. Query-specific precision is defined as Pq = 11 ^ ^ ^ and query-specific recall Rq = 


I {relevant ()}n{retrieved()} | 
|{relevant(A^)}| 


I { retrieved( A1)} | 

The /I score for the query node’s entity q is defined as: 


/I, = 2 


RqPq 
Ra + Pq 


The flq score is a good indicator of entity and answer quality. For multi-query experiments we calculate the 
average flq scores for each query node. The run of each non-parallel algorithm is averaged over 3 to 10 runs. 


^http: //cs. neiu. edu/'"C}culotta/data/rexa. html 










Table 4: Features used on the NYT Corpus. The first set of features are token specific features, the middle set 
are between pairs of mentions and the bottom set are entity wide features. 


Feature name 

Score"*" 

Score 

Feature type 

Equal mention Strings 

+20 

-15 

Token Specific 

Equal first character 

+5 


Token Specific 

Equal second character 

+3 


Token Specific 

Equal second character 

0 



Unequal mention Strings 


-15 

Token Specific 

Unequal first character 

0 



Unequal second character 

0 



Unequal second character 

0 



Equal substrings 

+30 

-150 

Token Specific 

Unequal substrings 


-150 

Token Specific 

Equal string lengths 

+10 


Token Specific 

Matching first term 

+90 

-3 

Token Specific 

No matching first term 


-3 

Token Specific 

Similarity > 0.99 

+120 


Pairwise 

Similarity > 0.90 

+105 


Pairwise 

Similarity > 0.80 

+80 


Pairwise 

Similarity > 0.70 

+55 


Pairwise 

Similarity > 0.60 

+35 


Pairwise 

Similarity > 0.50 

+15 


Pairwise 

Similarity > 0.40 


-5 

Pairwise 

Similarity > 0.30 


-50 

Pairwise 

Similarity > 0.20 


-80 

Pairwise 

Similarity < 0.20 


-100 

Pairwise 

Matching terms 

+20 


Pairwise 

Token in context 

+1 


Pairwise 

No matching keyword 

+700 

-10 

Pairwise 

Matching Keyword 

+700 


Pairwise 

Keyword in token 

+70 


Pairwise 

Extra Token 


-500 

Pairwise 

Matching token in context 

+10 


Pairwise 

Similar neighbor 

+100 

-5 

Entity-wide 

No Similar neighbor in entity 


-5 

Entity-wide 

Matching document 

+350 

-15 

Entity-wide 

No Matching documents in 


-15 

Entity-wide 

entity 






Hybrid-repel queries over the NYT Corpus 



Figure 5: Hybrid-repel performance for the first 50 samples for three queries. Each result is averaged over 6 runs 


6.2 Realtime Query-Driven ER Over NYT In this experiment we show that query-driven entity resolution 
techniques allow us to obtain near realtim^ results on large data sets such as the NYT corpus. 

Figure shows the flq score of the hybrid ER algorithms with three single-query ER queries. The graph 
shows performance over the first 50 proposals. For example, the ‘Zuckerberg’ query could be expressed as shown 
in Algorithmic 


Algorithm 8 Example ER query over the entity ‘zuckerberg’ 


SELECT * FROM 
Mention m 

WHERE coref_map(m.*, entity'p), ‘zuckerberg’, context). 


Recall, a canopy is first generated using an approximate match over the mention set. We use the repel 
inference function and all the mentions are initialized in one large entity. The ‘Richard Hatch’ and ‘Carnegie 
Mellon’ queries start at an fig score of .92 and .97, respectively. The ‘Zuckerberg’ query starts above .65 and 
improves to an fig score over .8. 

These experiments show the repel method removing mismatches from the query entity. The co-influence 
function is used to quickly identify the mentions that do not belong in the entity and they are proposed to be 
removed. When a hybrid move is proposed, a mention from the large entity moved from a large entity group to 
a new, possibly empty, entity. This method relies on the good repulsion features and correct weights. 

In Table [C we show the performance of three queries. In addition to the query token we add four columns: 
blocking time in seconds, canopy size, inference time in seconds and the total compute time. Total time is the 
complete time taken by each run, this includes building of the influence data structure and result writing. The 
values in Table show that fast performance of query-driven ER over a large database of mentions. 

6.3 Single-query ER In this experiment we show a performance comparison between the single-query 
algorithms summarized in Sections and We run the query-driven algorithms over queries with different 
selectivity levels and show the accuracy over time. Each algorithm uses the attract method, so each mention in 
the canopy starts in its own entity. 


^We define realtime as only contributing a small or no time loss when this process a part of an external execution pipeline such as 


an information extraction pipeline. 











Table 5: The performance of the hybrid-repel ER algorithm for queries over the NYT corpus for the first 50 
samples. Total time includes the time to build the X data structure and result output. The NYT Corpus contains 
over 71 million mentions, a large amount for the entity resolution problems. 


Query 

Blocking 

Mentions 

Inference 

Total time 

Zuckerberg 

24.4 s 

103 

2 s 

37 s 

Richard Hatch 

28.3 s 

226 

18.5 s 

59 s 

Carnegie Mellon 

25.9 s 

1302 

68 s 

124 s 


Low Selectivity Queries over Rexa 



Figure 6: A comparison of single-query algorithms on a query with selectivity of 11 


Figure shows the run time of all four algorithms on the Rexa data set with the query ‘Nemo Semret’, 
an author with a selectivity of 11. The performance for the baseline entity resolution does not get a correct 
proposal until about 500 seconds. The baseline algorithm takes a long time to accept the first proposal because 
it is randomly trying to insert mentions into an existing entity. Target-fixed immediately begins to make 
correct proposals. Hybrid and query-proportional have the best performance and resolve the entity almost 
instantaneously. The hybrid chooses the most likely nodes to merge into the query entity. As the first couple of 
proposals are correct merges, hybrid quickly converges to a high accuracy. Due to imperfect features, among the 
10 averaged runs a few runs get stuck at local optimum and causing suboptimal results. 

Figure shows the run time of four algorithms for query node id ‘A. A. Lazar’ with selectivity of 46. The 
baseline algorithm progresses the slowest. The hybrid algorithm quickly reaches a perfect fig score. Query- 
proportional algorithm lags slightly behind the hybrid method but still reaches a perfect value. The target-fixed 
algorithm gradually increases to a perfect fig score about 60 seconds after hybrid and query-proportional. 

Figurej^shows the run time of four algorithms with a query ‘Michael Jordan’ of selectivity 130. The baseline 
slowly increases over the 100 seconds. The hybrid algorithm again quickly achieves a perfect fig score followed 
by query-proportional and then target-fixed. The time gap between each of the algorithms increases with the 
increase in selectivity, hybrid achieves the best performance. 

We look deeper at how selectivity affects the rate of convergence. In Figure]^ we show the time it takes for 
each algorithm to reach an fig score of 0.95 over increasing selectivity. We choose five query nodes of increasing 
selectivity but with the same canopy sizes. The hybrid algorithm runtime increased with the increase in selectivity 
but only slightly steeper than constant. Target-fixed increased for the first three queries but did not last more 
than 50 seconds. Query-proportional has only a slight increase in time till convergence for the first three queries. 
The highest two selectivity queries are expensive for query-proportional and we observe an exponential increase in 
runtime. These results are consistent with the exponentially large increase in the number of random comparisons 
needed to find a match for a query entity. The query-proportional algorithm does not focus on the query entity 











Medium Selectivity Queries over Rexa 



Figure 7: A comparison of single-query algorithms with a query node of selectivity 46 


High Selectivity Queries over Rexa 



Figure 8: A comparison of selection-driven algorithms with a query node of selectivity 130 
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Figure 9: The time until an flq score of 0.95 for five queries of increasing selectivities; averaged over three runs 


as aggressively as target-fixed and hybrid algorithms. Recall that the target-fixed and the hybrid algorithm focus 
on moving correct nodes into the query entity. Query-proportional selects candidate nodes using the influence 
function but it does not fix the target entity. With the target entity not fixed, the chance of correct node for the 
query entity decrease exponentially. This shows that selectivity of nodes affects the runtime performance of each 
algorithm. When performing join-driven ER it is important to take the relative selectivity of nodes into account 
for choosing best scheduling algorithms. 


6.4 Multi-query ER In this experiment we study performance of our different scheduling algorithms for 
join-driven ER queries. We choose ten query nodes of different selectivity and run the join queries scheduling 
algorithms described in Section 4.3 Consider a table like the People table in Section with selectivity {130, 
63, 68, 7, 12, 12, 301, 11, 46}. The four algorithms, random, closest-first, farthest-first and selectivity-based are 
shown in Figure [Tol The selectivity-based method out performs the other three algorithms in terms of convergence 
rate. The jumps in accuracy on the graph correspond to the scheduling algorithms choosing new query nodes and 
accepting new proposals. It has a high jump when it starts sampling the seventh, and highest selectivity nodes. 
The farthest-first algorithm rises the slowest out of the scheduling algorithms because it tries to stop sampling the 
high performing query entity and makes proposals for the slowest growing. Selectivity-based method performs well 
early because the high selectivity queries are sampled first. The high selectivity query makes up a large proportion 
of the total flq score. The large jump in the random method is when it reaches the node with selectivity 301. 
Notice, closest-first reaches its peak flq score the fastest because it tries to get the most out of every query node. 

6.5 Context Levels In this experiment we aim to discover how different levels of context specified at query 
time can improve convergence time and overall accuracy. We take the zuckerberg query and the hybrid-repel 
algorithm and ran ER three times over three levels of context. Each mention in the graph contains a ‘paragraph’ 
level of context and we only alter the context of the query node. The ‘none’ context only activates token specific 
features, any context features involving the query node are zeroed out. The ‘paragraph’ level context is the 
default context from the NYT corpus and the ‘document’ level context extends context to the entire news article. 
Additionally, we add specific keywords from Mark Zuckerberg’s DBpedia page to the ‘document’ and ‘paragraph’ 


context levels. We show the performance using the repel method in Figure 11 


Adding specific keywords that activate the keyword features are the most effective methods for increasing 
the accuracy of query-driven ER. Query-driven methods allow a user to observe the results and add or remove 
keywords specific queries to improve the accuracy. This type of iterative improvement workflow is not feasible 
with batch methods. 










Scheduling on Rexa using Hybrid 



Figure 10: The progress of the hybrid algorithm across for multiple query nodes using difference scheduling 
algorithms. Each result is averaged over three runs 


Hybrid-repel on the NYT Corpus (Zuckerberg) 
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Figure 11: The performance of zuckerberg query with difference levels of context. Each result is averaged over 6 
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Figure 12: Hybrid-attract algorithm with random queries run over the Wikilinks corpus. Each plot starts after 
the Vose structures are constructed 
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6.6 Parallel Hybrid ER In this experiment has two objectives, first how does the hybrid algorithm perform 
in a canopy size of 1 million queries and what is the effect of increasing the number of queries nodes. In Figure [T^ 
the Hybrid algorithm is able to resolve entities in a short amount of time. The creation time of the Vose structure 
is about linear in the number of queries. The trend in the graph is that as the ratio of queries to entities increases 
the performance benefit of the hybrid-attract method decreases. With more query nodes the construction time 
increases and the benefits of the algorithm decrease and become no better than the baseline method. 

Experiment Summary Each of the query-driven methods outperform the baseline methods in terms of 
runtime while not losing out on accuracy. Across different data set sizes hybrid algorithms have the most consistent 
performance. If a system has a quality blocking function then it is better to use the co-influence entity resolution 
method. With multiple query nodes, selectivity-based is the most consistent performing algorithm. More accurate 
estimation of MCMC convergence performance could allow the dynamic scheduling algorithms closest-first and 
farthest-first to achieve higher accuracy. The more contextual information that can be added to query nodes 
at query time causes higher accuracy of the entity resolution algorithms. Parallel query-driven sampling is an 
effective way to get speed up in an ER data set when the ratio of mentions to entities is low. 


7 Query-Driven Entity Resolution Related Work 

This chapter is related to work in several areas. In this section we describe a selection of the literature that we 
found most relevant to different parts of the Query-Driven ER task. 

Entity Resolution The state-of-the-art method for entity resolution employs collective classification. 
Instead of purely pairwise decisions, collective classihcation methods consider group relationships when making 
clustering determinations. In a recent tutorial [l4], collected classihcation methods were grouped into three 


categories: non-probabilistic [5 12 20 , probabilistic [8 13 23,28 29 35 and hybrid approaches [2 31 . A relevant 
challenge proposed for entity resolution research by the tutorial is how to efficiently perform entity resolution 
when a query is involved. This chapter seeks to address this issue. 

Entity resolution is generally an expensive, offline batch process. Bhattacharya and Getoor proposed a 
method for query-time entity resolution |^. This method performs inference by starting with a query node and 
performing ‘expand and resolve’ to resolve entities through resolution of attributes and expansion of hyper-edges. 
Unfortunately, hyper-edges between records are not always explicit in data sets. This chapter does not assume 
the presence of any link in the corpus, each entity or mentions are independently defined, which is the case for 
most applications. 

A recent paper by Altwaijry, Kalashnikov and Mehrotra has a similar motivation of using SQL queries to 
drive entity resolution. That work focuses using predicates in the query to drive computation while this work uses 
































example queries to drive computation. Both techniques are complementary and combining the two by updating 
the edge-picking policy described in their paper using our approach makes for interesting method of optimizing 
the entity resolution process. 

The term query-driven appears in this chapter and has appeared in others across literature with different 
meanings [16| . Our definition of a query node is an example item, mention, from a data set. A query in Altwaijry 
et al. are the predicates in an SQL statement. Query-driven in Grant et al. 16 is the SQL queries used to 
drive analytics. 

It is becoming increasingly normal to work with data sets of extremely large size, in response researchers have 
studied streaming and distributed processing. Rao, McNamee and Dredze describes an approach for streaming 
entity resolution 30 . This approach is fast and approximates entries in an LRU queue of clustered entity chains. 


We apply these techniques to static data set and do not yet handle streams of data. Singh, Subramanya, Pereira, 
and McCallum propose a technique for ER where entities are resolved in parallel blocks and then redistributed and 


resolved again in new blocks 33 . This parallel distribution method makes large-scale entity resolution tractable. 
In this chapter, we perform analysis on a similar scale data set but we show that great performance gains can be 
achieved when a query is specified. 

Query specific sampling Recently, several researchers have explored the idea of focusing sampling of 
graphical models to speed up inference. Below we discuss the three approaches that use sampling to speed up ER 
over graphical models. 


Query-Aware MCMC 40 found that when performing a query over a graphical model the cost of not sampling 
a node is exactly the nodes influence on the query node. This enables us to ignore some nodes that have low 
influence over the query node and incur a small amount of error. This influence score can be calculated as the 
mutual information between two nodes. The authors compare estimation techniques of the intractable mutual 
information score, this is called the influence trail score. Because ER has a fixed pairwise model, we can use the 
theory from this work and specialized data structures to gain performance when query-driven sampling. 

Type-based MCMC is a method of sampling groups of nodes of with the same attribute to increase the 
progress towards convergence 


24 . This approach works well when feature sets can be tractably counted and 


grouped. If query nodes are introduced it is not clear how one may focuses type-based sampling. 

Other researchers have explored using belief propagation with queries to approximate marginal of factor 
graphs [^. However, the entity resolution graph is cyclic and highly connected. MCMC scales with large real 


world models better than loopy belief propagation 40 


8 Query-Driven Entity Resolution Summary 

In this chapter, I propose new approaches for accelerating large-scale entity resolution in the common case that 
the user is interested in one or a watch list of entities. These techniques can be integrated into existing data 
processing pipelines or used as a tool for exploratory data analysis. We showed three single-node ER algorithms 
and three scheduling algorithms for multi-query ER and show experimentally how their runtime performance is 
several orders of magnitude better than the baseline. 
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